Add batched RoPE kernel #3095

tterrysun · 2024-02-28T21:56:14Z

Problem: Currently we need to call rotary embedding kernel for each LoRA request, which makes it very inefficient to serve multiple LoRAs with different context length.

Solution: Add batched rotary embedding kernel. Followup PRs will pipe it through.

Testing: Batched kernel tests. Followup PRs will add e2e tests.

pcmoritz · 2024-02-29T19:55:55Z

csrc/pos_encoding_kernels.cu

@@ -77,6 +77,48 @@ __global__ void rotary_embedding_kernel(
  }
 }

+template<typename scalar_t, bool IS_NEOX>
+__global__ void batched_rotary_embedding_kernel(


This kernel is almost exactly the same as rotary_embedding_kernel and you can make them the same by adding the const int64_t* __restrict__ cos_sin_cache_offsets (will be a null ptr if it is not set) argument there and then down below, doing

int64_t cos_sin_cache_offset = cos_sin_cache_offsets ? cos_sin_cache_offsets[token_idx] : 0;

cos_sin_cache_offset is passed as a pointer, we don't have a good way to determine if it's empty without auxiliary flag, also we try to avoid runtime branching in kernel code for performance. agreed that these two kernels are pretty much the same so I refactored it to avoid too much code duplication.

pcmoritz · 2024-02-29T19:59:25Z

Do you have a (micro-)benchmark that shows the difference between batched and non-batched to justify the change?

pcmoritz · 2024-03-04T18:37:16Z

@tterrysun Can you add the micro benchmarks so we can measure the performance here? You can put it into benchmarks/kernels :)

tterrysun · 2024-03-06T21:06:20Z

Benchmarking command:
nsys profile -t nvtx,osrt --force-overwrite=true --stats=true --output=./rope_bm python benchmarks/kernels/benchmark_rope.py
Results:
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Style Range

 89.3           374257          1  374257.0  374257.0    374257    374257          0.0  PushPop  non-batched
 10.7            44781          1   44781.0   44781.0     44781     44781          0.0  PushPop  batched

note that this is simulating serving 4 LoRAs, the more LoRAs served, the bigger the difference between single batch kernel & multiple non-batched kernels, majority of the difference should be from Python side. When serving a single LoRA, they should be equivalent

Yard1 · 2024-03-08T23:17:25Z

benchmarks/kernels/benchmark_rope.py

+                        type=int,
+                        choices=[64, 80, 96, 112, 128, 256],
+                        default=128)
+    parser.add_argument("--rottery-dim",


Suggested change

parser.add_argument("--rottery-dim",

parser.add_argument("--rotary-dim",

Yard1 · 2024-03-08T23:18:04Z

benchmarks/kernels/benchmark_rope.py

+        seq_len=args.seq_len,
+        num_heads=args.num_heads,
+        head_size=args.head_size,
+        rotary_dim=args.rottery_dim,


Suggested change

rotary_dim=args.rottery_dim,

rotary_dim=args.rotary_dim,

Yard1 · 2024-03-08T23:27:13Z

vllm/model_executor/layers/rotary_embedding.py

@@ -158,27 +169,30 @@ def __init__(
        max_position_embeddings: int,
        base: int,
        is_neox_style: bool,
-        scaling_factor: float,
+        scaling_factors: List[float],


can we also take in float by itself (coerce it into a list inside the init)?

jikunshang · 2024-03-15T02:36:17Z

vllm/model_executor/layers/rotary_embedding.py

@@ -107,7 +108,9 @@ def _forward(
            query_pass = query[..., self.rotary_dim:]
            key_pass = key[..., self.rotary_dim:]

-        cos_sin = self.cos_sin_cache[positions]
+        self.cos_sin_cache = self.cos_sin_cache.to(positions.get_device())


should use positions.device rather than positions.get_device().
https://pytorch.org/docs/stable/generated/torch.Tensor.get_device.html

ByronHsu · 2024-10-15T20:42:57Z

Hi I am wondering if this kernel currently used? I don't see changes in model code and not sure where the following PRs are.

add batched rope kernel

d0105ac

pcmoritz reviewed Feb 29, 2024

View reviewed changes

refactor kernel

09399b9

tterrysun added 4 commits March 5, 2024 23:12

benchmarking script wip

98f0c7a

benchmarking script on

d7f8869

Merge branch 'main' into batched_rope

d3fa2c1

formatting

ccb3c74

tterrysun marked this pull request as ready for review March 6, 2024 21:06

tterrysun requested a review from pcmoritz March 6, 2024 21:06

tterrysun added 3 commits March 7, 2024 15:19

update benchmarking script

bfbe4db

remove breakpoint

870fcf2

align bm behavior

77b0da5

Yard1 reviewed Mar 8, 2024

View reviewed changes

minor polishing

d7f691e

tterrysun requested a review from Yard1 March 8, 2024 23:36

Yard1 approved these changes Mar 13, 2024

View reviewed changes

Yard1 merged commit 7e9bd08 into vllm-project:main Mar 13, 2024
23 checks passed

starmpcc pushed a commit to starmpcc/vllm that referenced this pull request Mar 14, 2024

Add batched RoPE kernel (vllm-project#3095)

bb5cfb0

jikunshang reviewed Mar 15, 2024

View reviewed changes

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

Add batched RoPE kernel (vllm-project#3095)

1cb1cb7

ByronHsu mentioned this pull request Oct 15, 2024

Support vLLM-style rope flashinfer-ai/flashinfer#530

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add batched RoPE kernel #3095

Add batched RoPE kernel #3095

tterrysun commented Feb 28, 2024 •

edited

Loading

pcmoritz Feb 29, 2024 •

edited

Loading

tterrysun Mar 1, 2024 •

edited

Loading

pcmoritz commented Feb 29, 2024

pcmoritz commented Mar 4, 2024

tterrysun commented Mar 6, 2024 •

edited

Loading

Yard1 Mar 8, 2024

Yard1 Mar 8, 2024

Yard1 Mar 8, 2024

jikunshang Mar 15, 2024

ByronHsu commented Oct 15, 2024

	parser.add_argument("--rottery-dim",
	parser.add_argument("--rotary-dim",

Add batched RoPE kernel #3095

Add batched RoPE kernel #3095

Conversation

tterrysun commented Feb 28, 2024 • edited Loading

pcmoritz Feb 29, 2024 • edited Loading

Choose a reason for hiding this comment

tterrysun Mar 1, 2024 • edited Loading

Choose a reason for hiding this comment

pcmoritz commented Feb 29, 2024

pcmoritz commented Mar 4, 2024

tterrysun commented Mar 6, 2024 • edited Loading

Yard1 Mar 8, 2024

Choose a reason for hiding this comment

Yard1 Mar 8, 2024

Choose a reason for hiding this comment

Yard1 Mar 8, 2024

Choose a reason for hiding this comment

jikunshang Mar 15, 2024

Choose a reason for hiding this comment

ByronHsu commented Oct 15, 2024

tterrysun commented Feb 28, 2024 •

edited

Loading

pcmoritz Feb 29, 2024 •

edited

Loading

tterrysun Mar 1, 2024 •

edited

Loading

tterrysun commented Mar 6, 2024 •

edited

Loading